Clustering the Normalized Compression Distance for Virus Data

نویسندگان

  • Kimihito Ito
  • Thomas Zeugmann
  • Yu Zhu
چکیده

The present paper analyzes the usefulness of the normalized compression distance for the problem to cluster the HA sequences of virus data for the HA gene in dependence on the available compressors. Using the CompLearn Toolkit, the built-in compressors zlib and bzip are compared. Moreover, a comparison is made with respect to hierarchical and spectral clustering. For the hierarchical clustering, hclust from the R package is used, and the spectral clustering is done via the kLine algorithm proposed by Fischer and Poland (2004). Our results are very promising and show that one can obtain an (almost) perfect clustering. It turned out that the zlib compressor allowed for better results than the bzip compressor and, if all data are concerned, then hierarchical clustering is a bit better than spectral clustering via kLines.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A ew Method for Clustering Heterogeneous Data: Clustering by Compression

Nowadays, we have to deal with a large quantity of unstructured data, produced by a number of sources. For example, clustering web pages is essential to getting structured information in response to user queries. In this paper, we intend to test the results of a new clustering technique – clustering by compression – when applied to heterogeneous sets of data. The clustering by compression proce...

متن کامل

Master ’ s Thesis : Virus Data Clustering based on Kolmogorov Complexity

Influenza viruses are probably a major cause of morbidity and mortality world wide. Large segments of the human population are affected every year. In June 2009, World Health Organization declared the influenza due to a new strain of swine origin H1N1 was responsible for the 2009 influenza pandemic. And on June 11, the WHO declared an H1N1 pandemic moving the alert level to phase 6, marking the...

متن کامل

Clustering Heterogeneous Data Using Clustering by Compression

Nowadays, we have to deal with a large quantity of unstructured data, produced by a number of sources. The application of clustering on the World Wide Web is essential to getting structured information in response to user queries. In this paper, we intend to test the results of a new clustering technique – clustering by compression – when applied to heterogeneous sets of data. The clustering by...

متن کامل

Normalized Information Distance is Not Semicomputable

Normalized information distance (NID) uses the theoretical notion of Kolmogorov complexity, which for practical purposes is approximated by the length of the compressed version of the file involved, using a real-world compression program. This practical application is called ‘normalized compression distance’ and it is trivially computable. It is a parameter-free similarity measure based on comp...

متن کامل

A Hybrid Time Series Clustering Method Based on Fuzzy C-Means Algorithm: An Agreement Based Clustering Approach

In recent years, the advancement of information gathering technologies such as GPS and GSM networks have led to huge complex datasets such as time series and trajectories. As a result it is essential to use appropriate methods to analyze the produced large raw datasets. Extracting useful information from large data sets has always been one of the most important challenges in different sciences,...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010